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ABSTRACT 



The string-matching problem considered here is to find all occurrences of a 
given pattern as a substring of another longer string. When the pattern is 
simply a given string of symbols, there is an algorithm due to Morris, Knuth 
and Pratt which has a running time proportional to the total length of the 
pattern and long string together. This time may be achieved even on a 
Turing machine. The more difficult case where either string may have 
"don't care" symbols which are deemed to match with all symbols is also 
considered. By exploiting the formal similarity of string-matching with 
integer multiplication, a new algorithm has been obtained with a running 
time which is only slightly worse than linear. 
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1. Introduction. 

We consider several problems concerned with the matching of string:; 

of symbols. A typical practical problem is that we are given a (long) 

symbol string X = X rt X,X --*X , the "text", and another (short) string 
— U I I m 

Y = Y-Y-...Y , the "pattern", over the same finite alphabet £. The 
— 1 n r 

task is to find all occurrences of the pattern as a consecutive substring 
in the text, that is, to find all i, n < i < in, such that: 

Y = CX. ...X.] 
— i-n i 

The obvious naive algorithm tries each i in turn and compares Y. with 

X. . for j = 0,1,... as far as necessary, and is represented by the 
l-n+j 

following informal program. 

FOR i = n STEP 1 UNTIL m 

FOR j - STEP 1 UNTIL n 

IF Y . 4 X . . GOTO L 
REPEAT 
PRINT (i) 
L: REPEAT 

For example with the following strings the desired outputs would be 
4, 7, 12. 



0123456789 10 11 12 



X = 




Y = 



An upper bound on the computation time for this algorithm is O(m.n) and 
the matching of a^b with a n b shows that this bound is realistic. 

2. Morris-Knuth-Pratt algorithm. 

A considerable improvement on the naive procedure described above 
is afforded by an algorithm due to J. H. Morris, D. E. Knuth and 
V. R. Pratt [4], which has a running time which is 0(m + n) . The 
essential idea is that if we have successfully matched a segment of the 
string X with an initial segment of Y before reaching an inequality, then 
it is unnecessary and wasteful to read those symbols of X again since 
they are the same as the Y^segment. A better procedure is to carry out 
the first comparisons for the next relative position of the pattern Y, 
by comparing Y with a segment of itself, and of course the comparisons 
can be pre-computed once and for all at the beginning. The pre-comput- 
ation required is very quick and has the same general form as the mam 
computation itself. 

We shall describe a "theoreticians' version" of the Morris-Knuth- 
Pratt algorithm to simplify the presentation and analysis. For a symbol 
string Z = Z . . -Z , define the function P for i - 0,...,n by: 

P(i) = max {t | f\ Z. = Z and -l<t<i} 
0<r<t 

Provided we consider <^<_ x to be identically true, P(i) is always 
well-defined. It is not difficult to verify that: 

P (k) (i) = k th largest t such that A z ;_ r = z t - r and " 1 - t<i 

- • i £ - j 0<r<t 
if this is defined. 

(P (i) denotes the composition of P with itself k times, so P (i) = i 

and P^ k+1) (i) = P(P^(i)).) The usefulness of P results from the following 

recursive definition . 
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Provided that the string Z_ and the values of P(j) for j<i are readily 
accessible, the value of P(iH-l) may be computed in time less than 



c.(P(i) - P(i+l)+2) 

for some constant c, independent of i and n. This is because 

P(j+1)-1 * P(j) < j for all j 
and hence the k of the recursive definition satisfies: 

P(i) - P(i+1) + 2 > k > 1 
Therefore the total running time is bounded by: 

c.(P(0) - P(n)) + 2c(n+l) - 0(n). 

To solve our original problem we concatenate Y_ 9 a new symbol @, and 
X in that order and compute the values of P for the string Y @ X in time 
0(m+n). Because of the @, P can never take a value greater than 
n = |y| -1. The values of i for which P(i) = n, mark the positions where 
Y matches a substring of X» or more precisely: 

P(n+2+i) = n *=* A X. = Y 

0<r<n X " r n " r 
Y t 
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3. A Turing machine implementation. 

The linear time bound obtained in Section 2 for the Morris-Knuth- 
Pratt algorithm seems to depend not only on the use of a random access 
machine, but also on the assignment of unit cost tfc a memory access, 
for just the P array alone contains 0(n log n) bits when represented 
as a sequence of binary integers. This makes a linear time Turing 
machine implementation somewhat surprising. 

The central economy results from representing the P array by a 
table A of differences. Define P(-l) = -1 and let 
A(i) = 1 + P(i) - P(i+1) , -1 < i < n. 



Then 



i-1 
P(i) = i - E A(j) 



n-1 
and I A(j) = n - P(n) £ rri-1, 

j = -l 

so the A array can be represented in linear space, even using unary 
notation. 

We may expand the recursive definition of P in Section 2 as 
follows : 
Algorithm X. 
Stage (0): Set P(0) «- -1. Go to stage (1,1). 



Stage (i+1, k) : 



,(k) 



1. If Z [P 00 /-n+n ■ z i+1 » set p (i+D * p (i) + 1 and 8° t0 sta B e ( i+2 > D 

2. If P^ k) (i) = -1, set P(i+1) «- -1 and go to stage (i+2, 1). 



3. Otherwise, go to stage (i+1, k+1) . 



Algorithm X may be rewritten without explicit reference to P by using 
the A array and three new variables p, s and d. Inductively, at the beginning 
of stage (i+1, k) , the variables will satisfy: 

(3.1) 



P = P (k) (i) 



s = P (k) (i) -P (k+1) (i) 



d = P(i) - P (k) (i) 



(3.2) 
(3.3) 



Algorithm Y maintains these conditions and computes the A array. (The 
column vector notation denotes simultaneous assignment.) 

Algorithm Y . 



Stage (0) : 

A(-l) «- 1; 



o 
o 



; go to stage (1, 1) 



Stage (i+1, k) 



1. If Z = Z. +1 , then begin A(i) ■*• d; 



c • 

p 






s 


-<r 


s+A(p) 


d 
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go to stage (i+2, 1) . 



2. If p = -1, then begin A(i) «- d+1; 
go to stage (i+2, 1); end. 



Otherwise, begin 



s 



p - s 

p-1 



s - 



I A, 



j=p-s 

d + s 



V 




P 


s 


■<- 


s 


d 








; go to stage (i+1, k+1) ; end, 



It may be readily verified that conditions (3.1)-(3.3) hold after stage (0) , 
are preserved by the remaining stages, and the A's are computed correctly. 

The Turing machine to implement Algorithm Y has four tapes, each one- 
way infinite to the right. The input tape Z has two heads A and B. Tape Y 
has two heads C and D and holds the A's and d. p is represented by the 
positions of heads A and C. Tape S is used as a counter and holds s. Tape 
T is a scratch tape. The tapes with two heads may be replaced without time 
loss by several tapes with only one head per tape [3]. 

At the start of stage (i+1, k) , head A is scanning Z and head B is 
scanning Z.. Tape Y contains the binary word 

01 A(-l) 01 A(0) 01 A(l) 0i>01 A(i-l) 01 d 

head C is on the "0" immediately preceeding the block 1 ^ P (the p+2 n "0" 
from the left), and head D is on the rightmost non-blank square. Finally, 
the counter S contains the number s . 



Below are the Turing machine tapes of the example of Section 2 at the 
beginning of stage (6,2). 



Tape Z: 



Tape Y: 



-1 



-1 



15 



W 



7 8 



a a 











¥ 



^ 



Tape S: 



$ 1 







We now examine the operations that might be required in stage (i+1, k) . 
The first test, n Z - = Z ", is accomplished in three Turing machine steps, 
for heads A and B are only one square away from the symbols Z - and Z.,-, 
respectively. Similarly, the test "p = -1" becomes a test to see if head A 
is scanning the left endmarker, "$". 

The updating in case (1) is accomplished by shifting D right and 
printing a "0" , shifting heads A and B right one square, and moving C 
right to the next "0". As C is advanced, S is incremented once for each 
"1" that C passes over. 

The updating in case (2) is even easier. A and C are left alone, B 
moves right one square, and D moves right two squares, printing a "1" 
followed by a "0". 
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To accomplish case (3), head C is moved left over s zeros. For each M 0" 
passed over by C, head A moves one square to the left and head D moves one 
square to the right and prints a "1". For each "1" passed over by C, S 
is decremented. Since the counter S is modified by this process, its contents 
are first copied into the temporary counter T which is then used to control 
the iteration. 

We total up separately the time spent in each of the three cases. 
For each i, case (2) is executed during at most one of the stages (i+1, k) . 
Each such execution takes a constant amount of time, so the total over all 
stages is clearly 0(n) . 

When case (3) is executed at stage (i+1, k) , it takes time cs for some con- 

(k) (k+l) 

stant c, where s = P v y (i) - P v / (i) is the value of S at the start of the stage^for 

P-l 
I A. £ s. Let k. be the largest value of k for which a stage (i+1, k) 

j=p-s J 

is executed. Then stages (i+1, 1),..., (i+1, k.-l) all execute case (3) 

and stage (i+1, k.) executes case (1) or (2). Hence, the total time spent 

in case (3) from the start of stage (i+1, 1) to the start of stage (i+2, 1) 

k.-l 

is £c(P (k) (i) - p (k+1) (i)) - c(P(i) - P (ki) (i)) <; c(P(i) - P(i+D + l). 
k=l 

Summing over all i, the total time in case (3) is 0(n). 
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Finally, the time spent in case (1) is bounded by the number of times 
C is shifted right. But this is at most the eventual length I of tape 
Y plus the number of times C is shifted left. The latter occurs only in 
case (3) and hence is bounded by 0(n). Since 



Jfc = n + 2 + S A.<2n+3 = 0(n), 
Y j-1 J 

the total time spent in case (1) is also 0(n). 

It follows that the total time of the Turing machine is 0(n). 

4. "Don't care" symbols. 

An interesting extension of this simple string-matching problem 
which has practical applications results from the introduction of a 
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"don't care" symbol, <j>, into the alphabet. <J> has the property of 
"matching 11 with any symbol. We shall write " = " for this matching, so 

<f> H x for all x e l U {$} 
The Morris-Knuth-Pratt algorithm breaks down in this situation, 
basically because "=" is not a transitive relation, that is: 

x = y A y = z —A x = z 

The above implication is valid only if y ^ <J>. Transitivity was assumed 
implicitly in deriving the recursive relation used to compute P. The 
naive algorithm given initially works just as before, with "=" in place 
of "=". The ostensible aim of this paper is to produce a more efficient 
algorithm for string-matching with "don't care" symbols. This is 
achieved only for the case when I is a finite alphabet. 

5. Generalised linear products. 

Both of the string-matching problems described so far can be regarded 

as special cases of a very general "linear product". Given two vectors 

of elements, X = X n ,...,X and Y = Y n ,...,Y , the linear product with 
— u m -« u n 

respect to ® and ©, written X^Y, is a vector Z = z Q >-*-> z m + n where: 



Z t = >K X. ® Y. for k = 0,..., m+n 
For this to be meaningful, X., Y. S D, Z e E, for some sets D,E, and 

1 J K. 

® © are functions, 

® : D x D -> E 

© : E x E -> E, © associative 
If © is a, and ® is = or =, the middle m-n+1 truth values of the linear 
product give the information required in matching the text X against the 
reversal of Y, that is Y . ..Y Q , since 
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for n < k < m. The reason for introducing general linear products 
here lies in the following two cases. 

(i) Boolean product where © is v and ® is a 
and (ii) polynomial product where © is + and ® is x. 

The polynomial product is of course the ordinary multiplication of 
polynomials. The four products with which we are principally concerned 
are illustrated in Figure 1. 

6. Algorithms for linear products. 

For the simple string products the Morris-Knuth-Pratt algorithm 
can be extended to yield the complete linear product. If we append a 
string of n fs to the end of the text X, then the same algorithm 

correctly computes the last n truth-values of the linear product. We 

R n 
are thus computing the values of P for the string Y^ @ X <j> . For the 

first n elements of the product, we know of no better method than to 

reverse both strings and use the same procedure. 

For strings over a finite alphabet with "don't cares", we follow 

an indirect course, showing first that the computation time for string 

product is of the same order as that for Boolean product. If a, t are 

two distinct symbols of E, and X contains only % s and cf> ! s while Y_ 

contains only t t s and cf> f s, then the string product of X and Y is 

precisely the negation of the Boolean product of the strings x and Y, 

where 

Continued on Page 15 
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LINEAR PRODUCT Z - X 1 Y j Z - ^ X Y 

- - k i+j=k i j 

Examples: with the convention that 1,0 represent true ,fal 



se respectively. 



(1) 



(2) 



(3) 



(A) 



b a a b a 
baa 



110 1 
110 1 
10 10 



A 



10 10 1 



a b <J> <jj a 
a <J) b 



1110 
11111 
10 111 



A 



1001110 

10010101 
10 10 1 



101111010101 



10 10 10 1 
10 10 1 



1 



101112030201 



Figure 1. 
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since 



X. = 
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true 
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1 «-> 


X. 

l 
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Y. = 
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true 
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1 ++ 


Y. 
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■A, 


-i X. 

— i 


V 


-• Y. 
"3 


-f->- 


— i 


i- 



/\ X. = Y. *-* A -i X. v -i Y. -«-». -i V X. A Y. 

i+j=k L J . iiji k - 1 "3 i+ j =k "I "J 

Thus Boolean product is no harder than <{)-string product. On the other 
hand, let H be the predicate on Z U' {$} defined by: 

H (x) = 1 if x = p 

P 

= if x f p (or x = <J>) 
and extend H to strings in the obvious way. Then 



Z = X 



v 



Y = -i V H (X) 

afx 



H T (D 



Informally, this equation states that X and Y match in a given 

relative position if and only if there is no pair of distinct symbols 

ct>t £ Z which clash. Hence the cf>-string product takes the same time 

as the Boolean product to within a constant factor, independent of m 
and n. 

There is a considerable similarity between the Boolean product and 
the polynomial product over the integers, as is shown in the example above, 
When 1 and are identified with true and false respectively, the Boolean 
product can be obtained by performing the polynomial product and then by 
replacing any non-zero element by 1. This idea of embedding a Boolean 
algebra in a ring for computational purposes has been exploited to achieve 
a fast Boolean matrix multiplication and transitive closure algorithm 
ClL 
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One very convenient way to compute the polynomial product ic to 
embed the product in a single large integer multiplication, for which 
there are a variety of well-known efficient algorithms. For the 

polynomial product of the {0,l}-strings X_,...,X and Y n ,...,Y , 

m U n 

where m > n, the maximum possible coefficient in the product is n + 1. 
If we choose r so that 2 > n + 1, compute the integers 



m 



rj 



X(2 ) = Z X..2 ri and Y(2 r ) = I Y. .2 
i=0 X j=0 J 

and then multiply X(2 r ) by Y(2 r ), the result will be the product 

polynomial Z, evaluated at 2 r . Successive blocks of length r in the 

binary representation of Z(2 r ) will give the coefficients of Z, and 

by replacing non-zero coefficients by 1 we obtain the elements of the 

Boolean product. This is illustrated below. 

r > log 2 (n+1) , m > n 



0.. .OX 



0. . .ox n 



0...0Y 



0...0Y., 



0...0X 



m 



0...0Y 




m+n 



where Z = X [±] Y . 



1+1 is polynomial product. 



The operations required to construct X(2 ) and Y(2 ), and to pick 
out the coefficients of Z are very easy and efficient on a binary 
computer. On most computers there is fast special-purpose hardware for 
multiplication of integers up to a certain size, and efficient routines 
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for multiplying larger integers. These may be used to yield a -cod 
practical program for the Boolean product of strings of moderate length, 
which however has a running time that is still proportional to im. 

For truly large integers, the Schonhage-Strassen algorithm [5] 
multiplies M-digit numbers by N~digit numbers in a time which is 
0(M-log N ♦ log log N) for M > N, using a multi-tape Turing machine. For 
our application, M = mr = 0(m log n) and N = nr = 0(n log n) . 
Hence, 
Result . For a finite alphabet, the "don't care ir product of strings 

of lengths m and n, (m > n) , can be computed with a multitape 

2 
Turing machine in time O(m-(log n) -log log n) . 

7. Large alphabets and numbers of comparisons* 

The algorithm for ^-product described so far has the disadvantage- 

that the running time increases rapidly with the size of the alphabet I. 

i 1 2 
It is approximately proportional to | E | . By coding the symbols of Z 

into a binary alphabet we can use just two Boolean products for strings 

of length m.log|l| and n.log|z|. Provided |z| is bounded by a power of 

n, this introduces a factor of just log|z| into the running time. 

It is interesting to observe that the Morris-Knuth-Pratt algorithm 
works for an infinite al phabet, provided we take the predicate "=" as a 
basic operation. Our algorithm for (^-product is not of this form and 
we may ask whether there is any algorithm, with access to the strings 
only through the predicate " = ", which has. a computation time better than 
the obvious O(m.n). Under such a strict limitation the answer is "no", 
and this is easily seen by considering the product of the two strings 
_X = cf) and Y = <)> . All =-tests have the result true , but suppose 
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that during the execution of some algorithm there, is some test 

"X. = Y. ?" which is never made* The computation and output would be 

1 J i m _ i 

indistinguishable from that for the pair of strings <f> a <{> and 

^ t ^ n ~ 2 , where a, t G S and a ^ x, and therefore the algorithm 

cannot correctly compute the string product. Hence any ^-product 

algorithm of this class must sometimes make at least (m+1) (n+1) tests. 

The above restriction is perhaps a little severe, even if v:e 
consider the case of infinite I, so let us allow in addition an explicit 
test for the "don't care" symbol, that is "X.^ = 4 ?" or "Y. = tfr ?". The 
lower bound on the number of tests is now radically different, for we 
can show that 0(m+n) are sufficient. Unfortunately we still know of 
no algorithm with a total running time less than that of the naive 
algorithm for the (^-product over an infinite alphabet, 

8. Algorithm for <f>-product using 0(m+n) tests. 

We have to evaluate the (m+n+1) conjunctions 

Z, = ./A., X. = Y. for k = 0,..., m+n 
k i+j=k l j 

Firstly we determine all occurrences of <f> in X and Y, and replace by 

true any equivalence involving cf>. Possibly some of the Z^/s may thus 

be determined. So far as we know the remaining symbols in X and Y may 

be completely distinct. At each stage of the algorithm we shall 

maintain an equivalence relation on these symbols, such that we have 

determined that all symbols in the same equivalence class are identical. 

We can always choose "X. = Y. ?" for our next comparison, where X^ and 

Y. are in distinct equivalence classes, and Z. + . has yet to be 

J 

determined. If X. = Y., then the equivalence classes of X i and Y. can 
1 3 J 
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be united, whereas if X. 1 Y. then Z... can be determined as false. If 

i J i+J 

during the course of the algorithm the former case occurs (m+n+1) times 

then only one equivalence class remains, and if the latter case occurs 

(m+n+1) times then all the Z k T s have been determined. Either way, no 

further comparisons are required. Hence at most 2 (m+n+1) equality 

tests and m+n+2 <|>- tests are needed, giving a total which is 0(m+n) . 

9. On-line palindromes. 

A computation is performed on-line if the i output symbol is produced 

st 
before the i+1 input symbol is read. Let Z . = 1 if X . . .X. is a palindrome 

(i.e. if X Q . . .X. = X. . . .X ) , and Z. = otherwise. Then 



z-*@* 



so _Z can be computed in time 0(n) , even on a Turing machine, as outlined in 
Sections 3 and 6 using the Morris-Knuth-Pratt algorithm. 

Fischer and Stockmeyer [2] present a general procedure for converting 
any off-line multiplication algorithm which runs in time T(n) to an on-line 
method taking time 0(T(n) log n) when T satisfies T(2n) :> 2T(n) . Their 
construction applies to any generalised linear product, so in particular, the 
time 0(n) method above for computing Z can be converted to an on-line Turing 
machine program that runs in time 0(n log n) . 
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10. Conclusions and open problems. 

We have considered string-matching problems with and without a 
"don't care" symbol. In both cases a naive procedure, based directly 
on the problem definition, takes time proportional to m x n where m, n 
are the lengths of the two strings to be matched. The Morris-Knuth- 
Pratt algorithm provides a practical and elegant way to compute the 
former problem in 0(m+n) time, but there seems to be no obvious 
extension of their algorithm to the "don't care" case. This is partially 
explained by our lower bound result which shows that 0(mn) is the best 
possible bound unless more information is allowed than the mere results 
of comparisons between pairs of symbols. With a further basic test 
which explicitly detects the "don't care" symbol, this lower bound 
collapses and there is at least the possibility of a faster algorithm. 
Provided that the symbol alphabet is finite, we have demonstrated an 
algorithm with a running time which is 0(m-log n- log log n) . The method 
is indirect and not of practical value except for very large m and n, 
however it shows the feasibility of algorithms which are faster than the 
naive procedure for "don't care" matching. 

We have not treated at all the superficially similar problem, where 
a "don't care" symbol can "match" an arbitrary string of symbols. A 
good algorithm for this would have obvious practical applications. 

We have only begun to compare and contrast the computational 
complexity of generalised linear products for various ® and © . There 
are several more, interesting, structures for which the linear product 
is a natural operation. A study of algorithms for linear products, 
based on the axiomatic properties of ® and ®, may provide valuable 
insight into why some products are easier than others. 
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